208 research outputs found
Structured, sparse regression with application to HIV drug resistance
We introduce a new version of forward stepwise regression. Our modification
finds solutions to regression problems where the selected predictors appear in
a structured pattern, with respect to a predefined distance measure over the
candidate predictors. Our method is motivated by the problem of predicting
HIV-1 drug resistance from protein sequences. We find that our method improves
the interpretability of drug resistance while producing comparable predictive
accuracy to standard methods. We also demonstrate our method in a simulation
study and present some theoretical results and connections.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS428 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Correcting for heterogeneity in real-time epidemiological indicators
Auxiliary data sources have become increasingly important in epidemiological
surveillance, as they are often available at a finer spatial and temporal
resolution, larger coverage, and lower latency than traditional surveillance
signals. We describe the problem of spatial and temporal heterogeneity in these
signals derived from these data sources, where spatial and/or temporal biases
are present. We present a method to use a ``guiding'' signal to correct for
these biases and produce a more reliable signal that can be used for modeling
and forecasting. The method assumes that the heterogeneity can be approximated
by a low-rank matrix and that the temporal heterogeneity is smooth over time.
We also present a hyperparameter selection algorithm to choose the parameters
representing the matrix rank and degree of temporal smoothness of the
corrections. In the absence of ground truth, we use maps and plots to argue
that this method does indeed reduce heterogeneity. Reducing heterogeneity from
auxiliary data sources greatly increases their utility in modeling and
forecasting epidemics
Computationally Assisted Quality Control for Public Health Data Streams
Irregularities in public health data streams (like COVID-19 Cases) hamper
data-driven decision-making for public health stakeholders. A real-time,
computer-generated list of the most important, outlying data points from
thousands of daily-updated public health data streams could assist an expert
reviewer in identifying these irregularities. However, existing outlier
detection frameworks perform poorly on this task because they do not account
for the data volume or for the statistical properties of public health streams.
Accordingly, we developed FlaSH (Flagging Streams in public Health), a
practical outlier detection framework for public health data users that uses
simple, scalable models to capture these statistical properties explicitly. In
an experiment where human experts evaluate FlaSH and existing methods
(including deep learning approaches), FlaSH scales to the data volume of this
task, matches or exceeds these other methods in mean accuracy, and identifies
the outlier points that users empirically rate as more helpful. Based on these
results, FlaSH has been deployed on data streams used by public health
stakeholders.Comment: https://github.com/cmu-delphi/covidcast-indicators/tree/main/_delphi_utils_python/delphi_utils/flash_eva
Flexible Modeling of Epidemics with an Empirical Bayes Framework
Seasonal influenza epidemics cause consistent, considerable, widespread loss
annually in terms of economic burden, morbidity, and mortality. With access to
accurate and reliable forecasts of a current or upcoming influenza epidemic's
behavior, policy makers can design and implement more effective
countermeasures. We developed a framework for in-season forecasts of epidemics
using a semiparametric Empirical Bayes framework, and applied it to predict the
weekly percentage of outpatient doctors visits for influenza-like illness, as
well as the season onset, duration, peak time, and peak height, with and
without additional data from Google Flu Trends, as part of the CDC's 2013--2014
"Predict the Influenza Season Challenge". Previous work on epidemic modeling
has focused on developing mechanistic models of disease behavior and applying
time series tools to explain historical data. However, these models may not
accurately capture the range of possible behaviors that we may see in the
future. Our approach instead produces possibilities for the epidemic curve of
the season of interest using modified versions of data from previous seasons,
allowing for reasonable variations in the timing, pace, and intensity of the
seasonal epidemics, as well as noise in observations. Since the framework does
not make strict domain-specific assumptions, it can easily be applied to other
diseases as well. Another important advantage of this method is that it
produces a complete posterior distribution for any desired forecasting target,
rather than mere point predictions. We report prospective
influenza-like-illness forecasts that were made for the 2013--2014 U.S.
influenza season, and compare the framework's cross-validated prediction error
on historical data to that of a variety of simpler baseline predictors.Comment: 52 page
Stream ecosystem responses to an extreme rainfall event across multiple catchments in southeast Alaska
Floods are a key component of the flow regime of many rivers and a major structuring force of stream communities. Climate change is predicted to increase the frequency of extreme rainfall (i.e. return intervals > 100 years) leading to extensive flooding, but the ecological effects of such events are not well understood. Comparative studies of flood impacts are scarce, despite the clear need to understand the potentially contingent responses of multiple independent stream systems to extreme weather occurring at meso- and synoptic spatial scales. We describe the effect of an extreme rainfall event affecting an area >100,000 km2 that caused extensive flooding in SE Alaska. Responses of channel morphology and three key biological groups (meiofauna, macroinvertebrates and fish) were assessed in four separate and recently deglaciated stream catchments of contrasting age (38-180 years) by comparing samples taken before and after the event. Ecological responses to the rainfall and subsequent flooding differed markedly across the four catchments in response to variations in rainfall intensity and to factors such as channel morphology, stream sediment composition and catchment vegetation type and cover, which were themselves related to stream age. Our study demonstrates the value of considering multiple response variables when assessing the effects of extreme events, and highlights the potential for contrasting biological responses to extreme events across catchments. We advocate more comparative studies to understand how extreme rainfall and flooding affects ecosystem responses across multiple catchments
A probabilistic generative model for GO enrichment analysis
The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annotation databases. One problem is the large number of multiple hypotheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes. To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommodates noise and errors in the selected gene set and GO. Using controlled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods
Results from the centers for disease control and prevention's predict the 2013-2014 Influenza Season Challenge
Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013-14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013-March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1 %, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts. Ā© 2016 The Author(s)
Using data-driven rules to predict mortality in severe community acquired pneumonia
Prediction of patient-centered outcomes in hospitals is useful for performance benchmarking, resource allocation, and guidance regarding active treatment and withdrawal of care. Yet, their use by clinicians is limited by the complexity of available tools and amount of data required. We propose to use Disjunctive Normal Forms as a novel approach to predict hospital and 90-day mortality from instance-based patient data, comprising demographic, genetic, and physiologic information in a large cohort of patients admitted with severe community acquired pneumonia. We develop two algorithms to efficiently learn Disjunctive Normal Forms, which yield easy-to-interpret rules that explicitly map data to the outcome of interest. Disjunctive Normal Forms achieve higher prediction performance quality compared to a set of state-of-the-art machine learning models, and unveils insights unavailable with standard methods. Disjunctive Normal Forms constitute an intuitive set of prediction rules that could be easily implemented to predict outcomes and guide criteria-based clinical decision making and clinical trial execution, and thus of greater practical usefulness than currently available prediction tools. The Java implementation of the tool JavaDNF will be publicly available. Ā© 2014 Wu et al
- ā¦